Search CORE

67 research outputs found

Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms

Author: Boley Mario
Mandros Panagiotis
Vreeken Jilles
Publication venue
Publication date: 01/01/2018
Field of study

The reliable fraction of information is an attractive score for quantifying (functional) dependencies in high-dimensional data. In this paper, we systematically explore the algorithmic implications of using this measure for optimization. We show that the problem is NP-hard, which justifies the usage of worst-case exponential-time as well as heuristic search methods. We then substantially improve the practical performance for both optimization styles by deriving a novel admissible bounding function that has an unbounded potential for additional pruning over the previously proposed one. Finally, we empirically investigate the approximation ratio of the greedy algorithm and show that it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search.Comment: Accepted to Proceedings of the IEEE International Conference on Data Mining (ICDM'18

arXiv.org e-Print Archive

CISPA – Helmholtz-Zentrum für Informationssicherheit

Crossref

MPG.PuRe

The Efficient Discovery of Interesting Closed Pattern Collections

Author: Boley Mario
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

Enumerating closed sets that are frequent in a given database is a fundamental data mining technique that is used, e.g., in the context of market basket analysis, fraud detection, or Web personalization. There are two complementing reasons for the importance of closed sets---one semantical and one algorithmic: closed sets provide a condensed basis for non-redundant collections of interesting local patterns, and they can be enumerated efficiently. For many databases, however, even the closed set collection can be way too large for further usage and correspondingly its computation time can be infeasibly long. In such cases, it is inevitable to focus on smaller collections of closed sets, and it is essential that these collections retain both: controlled semantics reflecting some notion of interestingness as well as efficient enumerability. This thesis discusses three different approaches to achieve this: constraint-based closed set extraction, pruning by quantifying the degree or strength of closedness, and controlled random generation of closed sets instead of exhaustive enumeration. For the original closed set family, efficient enumerability results from the fact that there is an inducing efficiently computable closure operator and that its fixpoints can be enumerated by an amortized polynomial number of closure computations. Perhaps surprisingly, it turns out that this connection does not generally hold for other constraint combinations, as the restricted domains induced by additional constraints can cause two things to happen: the fixpoints of the closure operator cannot be enumerated efficiently or an inducing closure operator does not even exist. This thesis gives, for the first time, a formal axiomatic characterization of constraint classes that allow to efficiently enumerate fixpoints of arbitrary closure operators as well as of constraint classes that guarantee the existence of a closure operator inducing the closed sets. As a complementary approach, the thesis generalizes the notion of closedness by quantifying its strength, i.e., the difference in supporting database records between a closed set and all its supersets. This gives rise to a measure of interestingness that is able to select long and thus particularly informative closed sets that are robust against noise and dynamic changes. Moreover, this measure is algorithmically sound because all closed sets with a minimum strength again form a closure system that can be enumerated efficiently and that directly ties into the results on constraint-based closed sets. In fact both approaches can easily be combined. In some applications, however, the resulting set of constrained closed sets is still intractably large or it is too difficult to find meaningful hard constraints at all (including values for their parameters). Therefore, the last part of this thesis presents an alternative algorithmic paradigm to the extraction of closed sets: instead of exhaustively listing a potentially exponential number of sets, randomly generate exactly the desired amount of them. By using the Markov chain Monte Carlo method, this generation can be performed according to any desired probability distribution that favors interesting patterns. This novel randomized approach complements traditional enumeration techniques (including those mentioned above): On the one hand, it is only applicable in scenarios that do not require deterministic guarantees for the output such as exploratory data analysis or global model construction. On the other hand, random closed set generation provides complete control over the number as well as the distribution of the produced sets.Das Aufzählen abgeschlossener Mengen (closed sets), die häufig in einer gegebenen Datenbank vorkommen, ist eine algorithmische Grundaufgabe im Data Mining, die z.B. in Warenkorbanalyse, Betrugserkennung oder Web-Personalisierung auftritt. Die Wichtigkeit abgeschlossener Mengen ist semantisch als auch algorithmisch begründet: Sie bilden eine nicht-redundante Basis zur Erzeugung von lokalen Mustern und können gleichzeitig effizient aufgezählt werden. Allerdings kann die Anzahl aller abgeschlossenen Mengen, und damit ihre Auflistungszeit, das Maß des effektiv handhabbaren oft deutlich übersteigen. In diesem Fall ist es unvermeidlich, kleinere Ausgabefamilien zu betrachten, und es ist essenziell, dass dabei beide o.g. Eigenschaften erhalten bleiben: eine kontrollierte Semantik im Sinne eines passenden Interessantheitsbegriffes sowie effiziente Aufzählbarkeit. Diese Arbeit stellt dazu drei Ansätze vor: das Einführen zusätzlicher Constraints, die Quantifizierung der Abgeschlossenheit und die kontrollierte zufällige Erzeugung einzelner Mengen anstelle von vollständiger Aufzählung. Die effiziente Aufzählbarkeit der ursprünglichen Familie abgeschlossener Mengen rührt daher, dass sie durch einen effizient berechenbaren Abschlussoperator erzeugt wird und dass desweiteren dessen Fixpunkte durch eine amortisiert polynomiell beschränkte Anzahl von Abschlussberechnungen aufgezählt werden können. Wie sich herausstellt ist dieser Zusammenhang im Allgemeinen nicht mehr gegeben, wenn die Funktionsdomäne durch Constraints einschränkt wird, d.h., dass die effiziente Aufzählung der Fixpunkte nicht mehr möglich ist oder ein erzeugender Abschlussoperator unter Umständen gar nicht existiert. Diese Arbeit gibt erstmalig eine axiomatische Charakterisierung von Constraint-Klassen, die die effiziente Fixpunktaufzählung von beliebigen Abschlussoperatoren erlauben, sowie von Constraint-Klassen, die die Existenz eines erzeugenden Abschlussoperators garantieren. Als ergänzenden Ansatz stellt die Dissertation eine Generalisierung bzw. Quantifizierung des Abgeschlossenheitsbegriffs vor, der auf der Differenz zwischen den Datenbankvorkommen einer Menge zu den Vorkommen all seiner Obermengen basiert. Mengen, die bezüglich dieses Begriffes stark abgeschlossen sind, weisen eine bestimmte Robustheit gegen Veränderungen der Eingabedaten auf. Desweiteren wird die gewünschte effiziente Aufzählbarkeit wiederum durch die Existenz eines effizient berechenbaren erzeugenden Abschlussoperators sichergestellt. Zusätzlich zu dieser algorithmischen Parallele zum Constraint-basierten Vorgehen, können beide Ansätze auch inhaltlich kombiniert werden. In manchen Anwendungen ist die Familie der abgeschlossenen Mengen, zu denen die beiden oben genannten Ansätze führen, allerdings immer noch zu groß bzw. ist es nicht möglich, sinnvolle harte Constraints und zugehörige Parameterwerte zu finden. Daher diskutiert diese Arbeit schließlich noch ein völlig anderes Paradigma zur Erzeugung abgeschlossener Mengen als vollständige Auflistung, nämlich die randomisierte Generierung einer Anzahl von Mengen, die exakt den gewünschten Vorgaben entspricht. Durch den Einsatz der Markov-Ketten-Monte-Carlo-Methode ist es möglich die Verteilung dieser Zufallserzeugung so zu steuern, dass das Ziehen interessanter Mengen begünstigt wird. Dieser neue Ansatz bildet eine sinnvolle Ergänzung zu herkömmlichen Techniken (einschließlich der oben genannten): Er ist zwar nur anwendbar, wenn keine deterministischen Garantien erforderlich sind, erlaubt aber andererseits eine vollständige Kontrolle über Anzahl und Verteilung der produzierten Mengen

bonndoc – Der Publikationsserver der Universität Bonn

Mining Interesting Patterns in Multi-Relational Data

Author: De Bie Tijl
Mario Boley
Spyropoulou Eirini
Publication venue: University of Bristol
Publication date: 01/01/2013
Field of study

Explore Bristol Research

Effective parallelisation for machine learning

Author: Boley Mario
Gärtner Thomas
Kamp Michael
Missura Olana
Publication venue: Massachusetts Institute of Technology Press
Publication date: 01/01/2017
Field of study

We present a novel parallelisation scheme that simplifies the adaptation of learning algorithms to growing amounts of data as well as growing needs for accurate and confident predictions in critical applications. In contrast to other parallelisation techniques, it can be applied to a broad class of learning algorithms without further mathematical derivations and without writing dedicated code, while at the same time maintaining theoretical performance guarantees. Moreover, our parallelisation scheme is able to reduce the runtime of many learning algorithms to polylogarithmic time on quasi-polynomially many processing units. This is a significant step towards a general answer to an open question [21] on efficient parallelisation of machine learning algorithms in the sense of Nick’s Class (NC). The cost of this parallelisation is in the form of a larger sample complexity. Our empirical study confirms the potential of our parallelisation scheme with fixed numbers of processors and instances in realistic application scenarios

Nottingham ePrints

arXiv.org e-Print Archive

Nottingham eTheses

Fraunhofer-ePrints

MPG.PuRe

Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization

Author: Boley Mario
Schmidt Daniel F.
Tew Shu Yu
Publication venue
Publication date: 02/11/2023
Field of study

We present a novel method for tuning the regularization hyper-parameter,

\lambda

, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite

n

and thus requires the specification of a set of candidate

\lambda

, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough

n

, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough

n

, allowing for both the optimal

\lambda

and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in

O(\min(n, p))

operations, for input data with

n

rows and

p

columns. In contrast, evaluating a single value of

\lambda

using fast LOOCV costs

O(n \min(n, p))

operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of

l

for

l

candidate values for

\lambda

(in the regime

q, p \in O(\sqrt{n})

where

q

is the number of regression targets)

arXiv.org e-Print Archive

Discovering Dependencies with Reliable Mutual Information

Author: Boley Mario
Mandros Panagiotis
Vreeken Jilles
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

CISPA – Helmholtz-Zentrum für Informationssicherheit

MPG.PuRe

Monash University Research Portal

Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms (Extended Abstract)

Author: Boley Mario
Mandros Panagiotis
Vreeken Jilles
Publication venue: IJCAI
Publication date: 01/01/2019
Field of study

Finding (functional) dependencies between attributes in databases is a well-known problem with applications in knowledge discovery, feature selection, and database management. While the recently introduced reliable fraction of information measure allows to soundly quantify dependence in a way that avoids overfitting when optimizing over high-dimensional spaces, the algorithmic implications of using this score have not yet been systematically explored. This includes the computational complexity of the resulting optimization problem. To this end, this paper provides the following contributions: We show that the problem of maximizing the reliable fraction of information is NP-hard, which justifies the usage of worst-case exponential-time as well as heuristic search methods that do not guarantee optimal solutions. We then greatly improve the practical performance for both of these optimization styles by deriving a novel admissible bounding function, which has an unbounded potential for additional pruning over the previously proposed one. Finally, we empirically investigate for the first time the approximation ratio of the greedy algorithm and show that in fact it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search. All findings are evaluated on a wide range of real-world datasets that are publicly available along with the implementation of the algorithmic contributions. Our results suggest that in scenarios where no hard optimality guarantees are required, greedy optimization is a good alternative to branch-and-bound for dependency discovery. Also, the definition of the tighter bounding function is potentially more generally applicable than just to the reliable fraction of information and might be transferrable to other dependency measures

CISPA – Helmholtz-Zentrum für Informationssicherheit